Skip to content

release: gastown-staging -> main#3151

Open
jrf0110 wants to merge 11 commits into
mainfrom
gastown-staging
Open

release: gastown-staging -> main#3151
jrf0110 wants to merge 11 commits into
mainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 9, 2026

Summary

Promotes 5 commits from gastown-staging to main. Three independent fix groups plus a developer-facing test procedure:

  1. Boot-hydration timeout fix — unblocks /agents/start during container boot hydration and preserves mayor tools on prewarm.
  2. GitHub auth correctness — fresh integration tokens instead of stale stored value, plus distinguished failure messages when no token resolves.
  3. Logging hygiene — redundant request-logging middleware replaced with per-route Hono-param tagging.
  4. Developer tooling — dev-only convoy debug endpoints and a deterministic review-then-land E2E test procedure.

Also lowers TownContainerDO.max_instances from 800 → 500 (as part of commit 1).

Constituent commits

1. Boot hydration + mayor prewarm fix (2ffcef28f, direct push)

Three independent fixes for the startAgentInContainer timeout regression observed after #2974, plus a tighter container-instance cap.

Symptoms. Production logs were filling with two error patterns since the last gastown-stagingmain promotion:

[<DOMAIN>] startAgentInContainer: EXCEPTION for agent <UUID>: TimeoutError: The operation was aborted due to timeout
timeout after 6000ms: ensureSDKServer for <agentId>

Root cause. The control server starts accepting requests immediately at boot (main.ts:83), while bootHydration() runs concurrently and serialises every registry agent + the new mayor prewarm through the global sdkServerLock (createKilo reads process.cwd()/process.env). Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) (resp. REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.

The mayor prewarm added in #3122 made things worse on two axes:

  1. It built KILO_CONFIG_CONTENT from hardcoded model defaults, so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.
  2. It was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID from the prewarm env. kilo serve snapshots process.env at spawn, and plugin/index.ts:66 keys mayor-tool registration off GASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user — manifesting as "mayor tools became unavailable."

Changes

1. Hydration gate (control-server.ts, process-manager.ts)

New awaitHydration() exported from process-manager.ts: a promise that bootHydration replaces on entry and resolves in a finally. Awaited at the top of /agents/start, /refresh-token, and PATCH /agents/:id/model (before any process.env mutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts)

New getMayorPrewarmContext() on TownDO returns { agentId, model, smallModel, kilocodeToken, organizationId } resolved the same way _ensureMayor resolves them (config.resolveModel(townConfig, null, 'mayor')). The /api/towns/:townId/mayor-id endpoint now returns that whole context so the container builds a KILO_CONFIG_CONTENT byte-identical to what the next /agents/start will send. Falls back to the bare { agentId } shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
  • Exported ensureMayorWorkspaceForTown(townId) so prewarmMayorSDK materialises the workspace before ensureSDKServer's process.chdir (was throwing ENOENT on cold containers).
  • buildPrewarmEnv now mirrors the mayor-shaped subset of buildAgentEnv: GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, KILOCODE_FEATURE='gastown', KILO_TEST_HOME, XDG_DATA_HOME. New end-to-end test intercepts createKilo and asserts those keys are visible to the spawn.
4. wrangler.jsonc

Lowered TownContainerDO.max_instances from 800 → 500 (manual change).

2. Remove manual request logging middleware (#3158, a6cf1029b)

Removes the redundant request-logging middleware in gastown.worker.ts that logged every request twice (-->/<-- via logger.info) — already covered by the per-route instrumented(c, route, handler) AE event wrapper. Replaces the regex-based logger.setTags block with proper per-route tagging using Hono c.req.param() matching for :orgId / :townId / :rigId / :agentId prefixes. Net diff: ~30 deletions + ~25 additions.

Link: #3158

3. Convoy debug endpoints + E2E test procedure (7f9121ffa, direct push)

Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM:

  • GET /debug/towns/:townId/rigs — list rigs in a town
  • POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly
  • GET /debug/towns/:townId/convoys — list active convoys with progress

Documents the new endpoints and adds a Test C section to services/gastown/docs/e2e-pr-feedback-testing.md with a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.

4. Fresh integration tokens for GitHub auth (ce15a6fe7, direct push)

resolveGitHubToken previously preferred git_auth.github_token over the platform integration. Since GitHub App installation tokens have a 1h TTL but git_auth.github_token is only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to:

  • Polecat/refinery gh CLI (via GH_TOKEN derived from GIT_TOKEN in the container), surfacing as "Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid."
  • The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR, areThreadsBlocking) — 401 from api.github.com.
  • The /refresh-git-token endpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed.

Fix flips priority to github_cli_pat → live integration → stored github_token (last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. Resolves a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars. buildContainerConfig now resolves a fresh token before serializing git_auth.github_token into the X-Town-Config header. Adds 6 unit tests covering the priority chain.

5. Distinguish null causes in PR status polling (#3160, 63873e425)

Fixes #3149.

Replace PRStatusResult | null return type with discriminated PRStatusOutcome union in checkPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structured PRStatusError with actionable failure messages.

Key changes:

  • resolveGitHubToken returns GitHubTokenResolution with resolution chain tracking which sources were tried (back-compat helper resolveGitHubTokenString exists for non-error-aware callers).
  • no_token and non-transient HTTP errors (401/403/404) fail the bead immediately (1 strike).
  • invalid_response/unrecognized_url/host_mismatch fail after 3 strikes.
  • Transient HTTP errors (5xx/429) keep existing 10-strike behavior.
  • poll_transient_count and poll_non_transient_count separate counters (replaces the cross-contaminated single poll_null_count); both reset on successful poll.
  • failureKind persisted to bead metadata for analytics.
  • AE event pr.poll_failed emitted on terminal failure.
  • resolveGitHubToken tracks the configured integration source even when GIT_TOKEN_SERVICE binding is missing.

Link: #3160

Verification

  • Unit-tested the hydration gate end-to-end with a fetch barrier (asserts awaiters block while bootHydration is in flight, release when it returns).
  • Unit-tested the prewarm env shape end-to-end (drives bootHydration with a /mayor-id fetch mock, intercepts createKilo, asserts GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, GASTOWN_CONTAINER_TOKEN, and a non-empty KILO_CONFIG_CONTENT are all visible at spawn time).
  • Reviewed the _ensureMayor model-resolution path to confirm resolveModel(townConfig, null, 'mayor') is byte-identical to what /agents/start will send (mayor role ignores rigOverride entirely in config.resolveModel).
  • Manual production verification deferred — these changes target a hot path that's hard to reproduce locally; will monitor Sentry / AE mayor.ensure_decision: short_circuit_warm and agent.startup_phase after merge.
  • pnpm --filter cloudflare-gastown typecheck passes.
  • Unit tests for PR polling: test/unit/pr-poll-errors.test.ts (checkPRStatus, resolveGitHubToken), test/unit/pr-poll-thresholds.test.ts (failureMessageFor, shouldFailImmediately, shouldCountAsTransient).
  • Integration test for no_token immediate-fail path: test/integration/pr-poll-errors.test.ts.
  • HTTP error scenarios covered by unit tests (mocking fetch is not practical in Cloudflare Workers integration test runtime).

Visual Changes

N/A

Reviewer Notes

  • The /api/towns/:townId/mayor-id response shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape and the legacy { agentId } shape with .passthrough(), and rolls back to "skip prewarm" on missing fields.
  • The organizationId fallback chain in buildPrewarmEnv distinguishes undefined (older worker, fall back to process.env) from null (worker authoritatively says "no org") so a stale env-var value can't override an authoritative null.
  • The hydration gate is a single global promise — bootHydration is currently single-call from main.ts. If we ever add periodic re-hydration, the resolver capture should move to a local inside bootHydration (called out in code review as a SUGGESTION, deferred).
  • Two SUGGESTION-level findings deferred from code review: (a) prewarmMayorSDK warns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10ms setTimeout (test still validates the positive case deterministically).
  • The refresh-git-token.handler.ts change is a caller update for the new GitHubTokenResolution return type (was string | null).
  • The wrangler.jsonc max_instances change (800→500) is from the boot hydration commit (2ffcef28f).

…ayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Comment thread services/gastown/container/src/control-server.ts Outdated
Comment thread services/gastown/src/gastown.worker.ts Outdated
Comment thread services/gastown/container/src/process-manager.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 9, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

✅ All Previously Flagged Issues Resolved
File Issue Status
services/gastown/container/src/control-server.ts process.env.GASTOWN_CONTAINER_TOKEN mutated before awaitHydration() ✅ Fixed
services/gastown/container/src/process-manager.ts _resolveHydration module-global stale-capture could orphan resolver ✅ Fixed
services/gastown/src/gastown.worker.ts Double RPC call to TownDO in mayor-id endpoint ✅ Fixed
services/gastown/src/gastown.worker.ts rigId not tagged for /api/users/:userId/rigs/:rigId routes ✅ Fixed
services/gastown/src/dos/town/actions.ts Misleading migration comment in poll_non_transient_count branch ✅ Fixed
Files Reviewed (all commits)
  • services/gastown/container/src/agent-runner.ts
  • services/gastown/container/src/control-server.ts
  • services/gastown/container/src/process-manager.test.ts
  • services/gastown/container/src/process-manager.ts
  • services/gastown/docs/e2e-pr-feedback-testing.md
  • services/gastown/src/dos/Town.do.ts
  • services/gastown/src/dos/town/actions.ts
  • services/gastown/src/dos/town/config.ts
  • services/gastown/src/dos/town/container-dispatch.ts
  • services/gastown/src/dos/town/town-scm.ts
  • services/gastown/src/gastown.worker.ts
  • services/gastown/src/handlers/refresh-git-token.handler.ts
  • services/gastown/test/integration/pr-poll-errors.test.ts
  • services/gastown/test/unit/pr-poll-errors.test.ts
  • services/gastown/test/unit/pr-poll-thresholds.test.ts
  • services/gastown/test/unit/town-scm.test.ts
  • services/gastown/wrangler.jsonc

Reviewed by claude-sonnet-4.6 · 532,046 tokens

jrf0110 and others added 4 commits May 10, 2026 18:00
* chore(gastown): remove manual request logging middleware

* fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.

* feat(gastown): per-route logger tagging via Hono params (review on #3158)

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…st procedure

Adds three dev-only debug endpoints for autonomous convoy testing without
going through the mayor LLM:

- GET  /debug/towns/:townId/rigs         — list rigs in a town
- POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly
- GET  /debug/towns/:townId/convoys      — list active convoys with progress

Documents the new endpoints and adds a Test C section to
e2e-pr-feedback-testing.md with a deterministic procedure for verifying
review-then-land convoys end-to-end (sub-bead PRs into the convoy feature
branch, then a landing PR into main). Also captures known issues observed
during verification: container MTU/TLS handshake failures with github.com,
'failed' blockers not gating dependents, and intermittent polecat skipping
of sub-PR creation.
… stale stored value

resolveGitHubToken previously preferred git_auth.github_token over the
platform integration. Since GitHub App installation tokens have a 1h
TTL but git_auth.github_token is only written at rig creation (or rare
manual refresh), every long-lived town with an integration was handing
out an expired token to:

- Polecat/refinery 'gh' CLI (via GH_TOKEN derived from GIT_TOKEN in
  the container), surfacing as 'Failed to log in to github.com using
  token (GH_TOKEN). The token in GH_TOKEN is invalid.'
- The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR,
  areThreadsBlocking) — 401 from api.github.com.
- The /refresh-git-token endpoint the container falls back to on auth
  failure — it returned the same expired token, so the retry just
  re-failed.

Verified by hitting api.github.com with a local town's stored token:
401 even though the integration service mints fresh ones fine.

Fix:
- Flip resolveGitHubToken's priority to github_cli_pat -> live
  integration -> stored github_token (last-resort fallback for towns
  with no integration). Empty-string responses from the integration
  service now warn and fall back instead of silently failing.
- Resolve a fresh token at agent dispatch (startAgentInContainer),
  merge dispatch (startMergeInContainer), and rig setup
  (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars.
- buildContainerConfig now resolves a fresh token before serializing
  git_auth.github_token into the X-Town-Config header — the container's
  syncTownConfigToProcessEnv path reads this on every request to update
  process.env.GIT_TOKEN, which buildLiveHotSwapEnv then derives GH_TOKEN
  from on token-refresh hot-swaps. townId is required (not optional) so
  a forgotten arg can't silently regress to the stale-token shape.
- syncConfigToContainer resolves a fresh token before persisting
  GIT_TOKEN to DO storage for next boot.

Adds 6 unit tests covering the priority chain (cli_pat preferred,
fresh integration over stale stored, fallback on lookup failure,
rig-level integration ID, no-config returns null).
…3160)

* fix(gastown): distinguish null causes in PR status polling (#3149)

Replace PRStatusResult | null return type with discriminated PRStatusOutcome
union in checkPRStatus. Each null cause (no token, HTTP error, invalid
response, unrecognized URL, host mismatch) now surfaces a structured
PRStatusError with actionable failure messages.

- resolveGitHubToken returns GitHubTokenResolution with resolution chain
- no_token and non-transient HTTP errors (401/403/404) fail immediately
- invalid_response/unrecognized_url/host_mismatch fail after 3 strikes
- Transient HTTP errors (5xx/429) keep existing 10-strike behavior
- poll_null_count resets to 0 on successful poll at both call sites
- failureKind persisted to bead metadata for analytics
- AE event pr.poll_failed emitted on terminal failure
- Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor,
  and threshold logic
- Integration test for no_token immediate-fail path

* style: apply oxfmt formatting

* fix(gastown): track integration source when GIT_TOKEN_SERVICE unbound (review on town-scm.ts:66)

When integrationId is set but GIT_TOKEN_SERVICE binding is missing,
the configured integration source was silently omitted from the tried
array. Add an else branch that pushes the source label with a
'(GIT_TOKEN_SERVICE not bound)' annotation so the no_token error
message lists all attempted sources.

* fix(gastown): fail immediately for unrecognized_url and host_mismatch (review on actions.ts:374)

Both are deterministic configuration errors that cannot self-resolve
on retry. Move them from the 3-strike bucket to the fail-immediately
bucket alongside no_token and non-transient http_error. Only
invalid_response remains in the 3-strike category.

* fix(gastown): use separate counters for transient vs non-transient poll errors (review on actions.ts:1350)

Replace the shared poll_null_count with poll_transient_count and
poll_non_transient_count. Each error category increments only its own
counter and resets the other, preventing cross-contamination where 9
transient errors followed by 1 non-transient error would incorrectly
fail the bead.

Legacy poll_null_count is migrated on first read: the transient branch
falls back to poll_null_count when poll_transient_count is absent.
This ensures in-flight beads at deploy time retain their existing
counter value. The non-transient branch does not read the legacy field
since these counters reset on every success anyway — at worst an
in-flight bead gets one extra retry for invalid_response.

* fix(gastown): resolve merge conflict in resolveGitHubToken - merge staging priority with PR #3160 structured return type

- resolveGitHubToken now uses staging's priority: cli_pat → integration → stored token
- Returns GitHubTokenResolution discriminated union (from PR #3160)
- Includes unbound-service else branch (GIT_TOKEN_SERVICE not bound)
- Adds resolveGitHubTokenString helper for non-error-aware callers
- Updates Town.do.ts, container-dispatch.ts, config.ts to use helper
- Updates town-scm.test.ts for GitHubTokenResolution return shape
- Updates pr-poll-errors.test.ts for new priority order

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
@jrf0110 jrf0110 changed the title fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm release: gastown-staging -> main May 11, 2026
John Fawcett added 4 commits May 11, 2026 16:21
… awaitHydration in /refresh-token

The /refresh-token handler assigned process.env.GASTOWN_CONTAINER_TOKEN
before awaiting hydration, inconsistent with PATCH /agents/:id/model which
gates first. Mid-hydration token refresh could cause buildPrewarmEnv to
pick up a different token than the one hydration captured locally.
…stead of module global

The _resolveHydration module-global stale-capture pattern would orphan
the first promise's resolver if bootHydration() were ever called
concurrently. Capturing resolve as a local inside bootHydration() itself
eliminates the risk and removes the module-global.
… getMayorPrewarmContext

getMayorPrewarmContext now returns { agentId } even when the kilocode
token is unavailable (instead of null), so the worker route no longer
needs to fall through to getMayorAgentId. This eliminates the redundant
agents.listAgents SQL query over a second RPC hop.
…igId routes

The per-route tagging middleware registered prefixes under
/api/orgs/:orgId/... but missed the parallel /api/users/:userId/rigs/:rigId
family. Without this, requests to those routes lack rigId in structured
log tags.
Comment thread services/gastown/src/dos/town/actions.ts Outdated
@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented May 11, 2026

Review observation dispositions

Observation A — "Request/response logging removed without replacement"

Intentional — PR #3158 deletes those manual log lines because instrumented() already emits structured AE events with route, userId, townId, rigId, agentId, beadId, durationMs, and error per route. The new per-route Hono-param tagging middleware preserves the tagging side of the old block. No tracing observability is lost; structured tracing replaces it.

Observation C — "Double GIT_TOKEN_SERVICE.getToken call per agent start"

Acknowledged. The second GIT_TOKEN_SERVICE.getToken is a KV cache hit, so the perf impact is negligible, but the duplication is real — will track as a separate cleanup since deduping requires changing buildContainerConfig's signature beyond what's in scope for this release.

Additional thread resolved

A 4th inline thread about a misleading migration comment in actions.ts was also addressed: the comment on the non-transient poll counter branch claimed poll_null_count migration, but the SQL doesn't include it (correctly, since invalid_response is a new error kind). Fixed in c65fbc2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gastown] Misleading "GitHub API returned null" error when town has no GitHub token

1 participant